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Title _ 

Model Adaptation Systeim and Method for Speaker Recognition 

Field of the Invention 

5 The present invention generally relates to a system and method for 

speaker recognition. In particular, although not exclusively, the present 
invention relates to speaker recognition incorporating Gaussian Mixture Models 
to provide robust automatic speaker recognition in noisy communications 
environments, such as over telephony networks and for limited quantities of 

10 training data. 

* * 

Discussion of Prior Art • 

In recent years, the interaction between computing systems and humans 
has been greatly enhanced by the use of speech recognition software. 

15 However, the introduction of speech based interfaces has presented the need 
for identifying and authenticating speakers to improve reliability and provide 
additional security for speech based and related applications. 

Various forms of speaker recognition systems have been utilised in such 
areas as banking and finance, electronic signatures and forensic science. An 

20 example of one such system is that disclosed in International Patent Application 
WO 99/23643 by T-Netix, Inc entitled f Model adaptation system and method for 
speaker verification*. The T-Netix document describes a system and method for 
adapting speaker verification models to achieve enhanced performance during 
verification and particularly, to a sub-word based speaker verification system 

25 having the capability of adapting a neural tree network (NTN), Gaussian mixture 
model (GMM), dynamic time warping template (DTW), or combinations of the 
above, without requiring additional time consuming retraining of the models. 

Another example of a speaker recognition system is disclosed in US 
Patent No. 6,088,699 by Maes (assigned to IBM) and is entitled 'Speech 
* 30 recognition with attempted speaker recognition for speaker model pre-fetching 
or alternative speech modelling'. Maes describes a system of identifying a 
speaker by text-independent comparison of an input speech signal with a stored 
representation of speech signals corresponding to one of a plurality of 
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speakers. The method of speaker recognition proposed by Maes utilises Vector 
Quantisation (VQ) scoring. 

US Patent No. 6,411,930 by Burges (assigned to Lucent Technologies 
inc) entitled 'Discriminative Gaussian mixture models for speaker verification' 
5 discloses a method of speaker recognition that utilises a Discriminative 
Gaussian mixture model (DGMM). A .likelihood sum of the single GMM is 
factored into two parts, one of which depends only on the Gaussian mixture 

model, and the other of which is a discriminative term. The discriminative term 

» 

allows for the use of a binary classifier, such as a Support Vector Machine 
10 (SVM). 

Another example of speaker recognition is discussed in US Patent No. 
6,539,351 by Chen et al (assigned to IBM) and entitled 'High dimensional 
acoustic modelling via mixtures of compound Gaussians with linear transforms'. 
Chen describes a method of modelling acoustic data with a combination of a 
15 mixture of compound Gaussian densities and a linear transform. All the 
methods disclosed for training the model combined with the linear transform 
utilise the Expectation Maximization (EM) method using an auxiliary function to 

maximise the likelihood. 

The systems described above do not provide a speaker recognition 
20 algorithm which performs reliably under adverse communications conditions, 
such as limited enrolment speech, channel mismatch, speech degradation and 
additive noise, which typically occur over telephony networks. 

It would be advantageous if a system and method of speaker recognition 
could be provided that is robust and would mitigate the effects of adverse 
25 communications conditions, such as channel mismatch, speech degradation 
and noise, while also enhancing speaker model estimation. 

Summary of the Invention 

In- one aspect of the present invention there is provided a method of 
30 speaker modelling, said method including the steps of: 

estimating a background model based on a library of acoustic data from 
a plurality of speakers representative of a population of interest; 



training a set of Gaussian mixture models (GMMs) from constraints 
provided by a library of acoustic data from a plurality of speakers representative . 
of a population of interest and the background model; 

estimating a prior distribution of speaker model parameters using 
information from the trained set of GMMs and the background model, wherein 
correlation information is extracted from the trained set of GMMs; 

obtaining a training sequence from at least one target speaker, 

estimating a speaker model for each of the target speakers wherein the 
speaker model estimation further includes: 

• estimating the speaker model using a GMM structure based on the 
maximum a posteriori (MAP) criterion, wherein the MAP criterion is 
a function of the training sequence and the estimated prior 
distribution. 

In another aspect of the present invention there is provided a system for 
speaker modelling, said system including: 

a library of acoustic data relating to a plurality of background speakers, 
representative of a population of interest; 

a library of acoustic data relating to a plurality of reference speakers, 
representative of a population of interest; 

a database containing training sequenced) said training sequence(s) 
relating to one or more target speaker(s); 

a memory for storing a background model and a speaker model for said 
one or more target speakers; and 

at least one processor coupled to said library, database and memory, 
wherein said at least one processor is configured to: 

• estimate a background model based on a library of acoustic data 
from a plurality of background speakers; 

• training a set of Gaussian mixture models (GMMs) from a library 
of acoustic data from a plurality of reference speakers and the 
background model; 

• estimating a prior distribution of speaker model parameters using 
information from the trained set of GMMs and the background 



model, wherein correlation information is extracted from the 
trained set of GMMs; 
• estimating a speaker model for said one or more target 
speaker(s), wherein the speaker model estimation further 
includes: 

■ estimating the speaker model using a GMM structure 
based on the maximum a posteriori (MAP) criterion, 
wherein the MAP criterion is a function of the training 
sequence and the estimated prior distribution; and 

< 

• store said background model and said speaker model in said 
memory. 

In a further aspect of the present invention there is provided a method of 
speaker recognition, said method including the steps of: 

estimating a background model based on a library of acoustic data from 
a plurality of background speakers; 

training a set of Gaussian mixture models (GMMs) from a library of 
acoustic data from a plurality of reference speakers and the background model; 

estimating a prior distribution of speaker model parameters using 
information from the trained set of GMMs and the background model, wherein 
correlation information is extracted from the trained set of GMMs; 

obtaining a training sequence from at least one target speaker; 

estimating a target speaker model for each of the target speakers 
wherein the speaker model estimation further includes: 

• estimating the speaker model using a GMM structure based on the 
maximum a posteriori (MAP) criterion, wherein the MAP criterion is 
a function of the training sequence and the estimated prior 
distribution. 

obtaining a speech sample from a speaker; 

evaluating a similarity measure between the speech sample and the 
target speaker model and between the speech sample and the background 
model; and 

identifying whether the speaker is one of said target speakers by 
comparing the similarity measures between the speech sample and said target 



speaker model and between the speech sample and the background model. 
Other normalisations at the feature,, model and score levels may also be applied 
to the said system. 

In still yet another aspect of the present invention there is provided a 

* 

system for speaker modelling and verification, said system including: 

a library of acoustic data relating to a plurality of background speakers, 
representative of a population of interest; 

a library of acoustic data relating to a plurality of reference speakers, 
representative of a population of interest; 

a database containing training sequences said training sequences 
relating to one or more target speakers; 

an input for obtaining a speech sample from a speaker; 

a memory for storing a background model and a speaker model for said 
one or more target speakers; and 

at least one processor wherein said at least one processor is configured 

to: 

• estimate a background model based on a library of acoustic data 
from a plurality of background speakers; 

• training a set of Gaussian mixture models (GMMs) from a library 
of acoustic data from a plurality of reference speakers and the 
background model; 

• estimating a prior distribution of speaker-model parameters using 
information from the trained set of GMMs and the background 
model, wherein correlation information is extracted from the 
trained set of GMMs; 

• estimate a speaker model for said one or more target speaker(s), 
wherein the speaker model estimation further includes: 

■ estimating the speaker model using a GMM structure 
based on the maximum a posteriori (MAP) criterion, 
wherein the MAP criterion is a function of the training 
sequence and the estimated prior distribution; and 
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• store said background model and said speaker model in said 
memory. 

• obtain a speech sample from a speaker; 

• evaluate a similarity measure between the speech sample and the 
5 target speaker model and between the speech sample and the 

background model; 

• verify if the speaker is a target speaker by comparing the similarity 
measures between the speech sample and the target speaker 
model and between the speech sample and the background model; 

10 and 

• grant access to the speaker if the speaker is verified as the target 

speaker. 

Preferably a library of correlation information is produced from the trained 
15 set of GMMs and the estimation of prior distribution of speaker model 
parameters is based on the library of correlation information and the 
background model. Most preferably, the library of correlation information 
includes the covariance of the mixture component means extracted from the 
trained set of GMM's. A prior covariance matrix of the component means may 
20 "then be compiled based on this library of correlation information. 

If required an estimate of the prior covariance of the mixture component 
means may be determined by the use of various methods such as maximum 
likelihood, Bayesian inference of. the correlation information using the 
background model covariance statistics as prior information or reducing the off- 

25 diagonal elements. 

The background speakers and reference speakers may be 
representative of, but not limited to, persons of selected ages, genders arid/or 
cultural backgrounds. 

The library of acoustic data used to train the set of GMMs is preferably 
30 independent of 'that used to estimate the background model, Le. no speaker 
should appear in both the set of background speakers and reference speakers. 
Additionally, a target speaker must not be a background speaker or reference 
speaker. 
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Preferably, the evaluation of the similarity measure involves the use of 
the expected frame-based log-likelihood ratio. 

The background model may also directly describe elements of the prior 
distribution. Preferably, the present invention utilises full target and background 

5 model coupling. 

The estimation of the prior information (in the form of the speaker model 
component mean prior distribution) may involve a single pass approach. 
Alternatively, it may involve an iterative approach whereby the library of 

reference speaker models are re-trained using an estimate of the prior 

» 

10 distribution and the prior distribution is subsequently re-estimated. This, process 
is then repeated until a convergence criterion is met. 

# 

The speech input for both training and testing may be directly recorded 
or may be obtained via a communication network such as the Internet, local or 
15 wide area networks (LAN's or WAN's), GSM or CDMA cellular networks, Plain 
Old Telephone System (POTS), Public Switched Telephone Network (PSTN), 
Integrated Services Digital Network (ISDN), various voice storage media, a 
combination thereof or other appropriate source. 

The speaker verification and identification may further include post- 
20 processing techniques such as feature warping, feature mean and variance 
normalisation, RASTA, modulation spectrum processing and Cepstral Mean 
Subtraction or a combination thereof to mitigate speech channel effects. 

Brief Details of the Drawings 

25 In order that this invention may be more readily understood and put into 

practical effect, reference will now be made to the accompanying drawings, 
which illustrate preferred embodiments of the invention, and wherein: 

FIG. 1 is a schematic block diagram-illustrating the background model 
estimation process; 

30 FIG. 2 is a schematic block diagram illustrating the process of obtaining a 

• * 

component mean covariance matrix in a ccor da n ce with one embodiment of the 
invention; 
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FIG. 3 is a schematic block diagram illustrating speaker model estimation 
.for a given target speaker in accordance with one embodiment of the invention; • 

FIG. 4 is a schematic block diagram illustrating speaker verification in 
accordance with one embodiment of the present invention; 
5 " FIG. 5 is a plot of Detection Error Trade off (DET) curves according to 
one embodiment of the present invention; and 

FIG. 6 is a plot of the Equal Error Rates (EER) according to one 
embodiment of the present invention. 

10 Description of the Preferred Embodiments 

In one embodiment of the invention there is provided a method of 

» < 

speaker modelling whereby prior speaker information is incorporated into the. 
modelling process. This is achieved through utilising the Maximum A Posteriori 
(MAP) algorithm and extending it to contain prior Gaussian component 

15 correlation information. 

This type of modelling provides the ability to model mixture component 
correlations by observing the parameter variations between a selection of 
speaker models. In the prior art previous speaker recognition modelling work 
assumed that the adaptation of the mixture component means were 

20 independent of other mixture components. 

With reference to figure 1 t there is. illustrated the first-stage in the 
modelling process of one embodiment of the present invention. Estimating a 
• background model 10 for speaker recognition may be performed in accordance 
with various methods, which are well known in the art. In the present case, the 

25 Expectation Maximisation (EM) algorithm is used to produce the background 
model. Pooled acoustic reference data 11 relating to a specific demographic of 
speakers (population of interest) from a given total population is trained via the 
EM algorithm 12 to produce a background model 13 which is .a general 
representation of the speech characteristics of the population of interest and is 

30 typically a large order Gaussian Mixture Model (GMM). 

Figure 2 depicts the second stage of the modelling process utilised by an 
embodiment of the present invention. The background model 1 3 is adapted 
utilising information from a plurality of reference speakers 21 in accordance with 
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the Maximum A Posteriori (MAP) criterion 22. The reference speaker 
information within this stage of the process is composed of data samples, which 
represent the population of interest. However, the this reference speaker 
information differs from the pooled acoustic reference data 11 used to obtain 

5 the background model in that it relates to a second group of speakers from the 
same demographic (i.e. no sample overlap). This preserves the statistical 
independency of the modelling process. 

Utilizing MAP estimation the reference speaker data and prior 
information obtainable from the background model parameters are combined to 

10 produce a library of adapted speaker models, namely Gaussian Mixture Models 
23. 

Using the Bayesian Inference approach, the model parameter set X for a 
single model is optimized.according to MAP estimation criterion given a speech 
utterance X . The MAP optimization problem may be represented as follows. 

1 5 Kap = ¥E pfaWpto (Eq - 1 } 

One approach is to have p(x|a) described by a mixture of Gaussian component 
densities, while p(X) is established as the joint likelihood of w„/< ( andE, being 
the weights, means and diagonal covariances of the Gaussian components 
respectively. The fundamental assumption specified by the prior information, 

20 without consideration of the mixture component weight effects, is that all mixture 
components are independent. Thus p(X) could be represented as the product 
of the joint GMM weight likelihood with the product of the individual component 
mean and covariance pair likelihoods as given by equation (2). 

f-1 

25 Here, let g(w l ,w 2 ,...,w„) be represented as a Dirichlet distribution and 

g(/t S9 L t \Q g ) be a Normal-Wishart density. The Dirichlet density is the conjugate 
prior density for the parameters of a multinomial density and the Normal- 
Wishart density is the prior for the parameters of the normal density. 

This form of joint likelihood calculation assumes that the probability 

30 density function of the component weights is independent of the mixture 
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* * 

component means and covariances. In addition, the joint distribution of .the 
mean and covariance elements is independent of all other mean and 
covariance parameters from other Gaussians in the mixture. 

« 

Thus, the MAP solution is solved by maximizing the following auxiliary 
5 function defined by equation (3). 

* 

^(U)«p(;i)n^ E r1^ < Et «- 3 > 

where c„ = Pri^x, y X) 

ftfg(g,|/t|.3,) 



r 



r 



1=1 c i 
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This is achieved by using the Expectation-Maximization procedure to maximize 

this function. Under the assumption that only the mixture component means will 

< 

10 be adapted, the resulting EM algorithm auxiliary function Is presented in 
equation (4) 

Here k and A are the new and old model estimates as a function of. the 
mixture component means. The variable c t is the accumulated probability count 

» 

15 (c,=y>„ with c = „ J T' 8 H*'i \ ) for mixture component I and r, is the 

diagonal precision matrix for each Gaussian component i (^=2:7'). The 
vectors //, and fa are the /th new and ofd adapted Gaussian means 

respectively, and *'-2^-i c « x '/ c f • 
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For the purposes of the present invention it is assumed that the 
distribution of the joint mixture component means is governed by a high 
dimensionality Gaussian density function. In order to represent this density, let 
the joint vector of the concatenated Gaussian means be represented as follows. 
In some works, this is described using the vec{} operator. 



M = 



■ 



<Eq. 5) 



Let the concatenated vector means have a global mean given by ft a and 
a precision matrix given by r^. Thus, for N mixture component means, with 
feature dimensionality D, M is a vector of length ND % while r G . is an ND by ND. 
10 square matrix. Thus the matrix r c ~ l is comprised of N by N sets of D by D 

m 

covariance blocks (with each block identified as between the corresponding 
D parameters of the /th and yth mixture component mean vectors. Given these 
conditions, the distribution of the- concatenated means may be given in full 
composite form such that g{£) is proportional to the following. 
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Equation (6) may be given in the following symbolic compressed form 



g(X) « expj- ^{M - fiai x 0 {M - ft a ) J 



(Eq. 7) 



In addition, the remainder of auxiliary equation (4) must be represented 
in a siniilar.matrix and vector form. The result is present in equation (8). 

• J] expj- - x, )' r,{fi, ~x,)\ = expj- i(M - *)' Cr{M - *)} (Eq.8) 
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The matrix C is a strictly diagonal matrix of dimension ND.by ND. This 

* 

matrix is comprised of diagonal block matrices C ]9 C 2 ,...,C N . Each matrix C, is 
a D dimensional identity matrix scaled by the mixture component accumulated 
probability count c, that was defined earlier. 

Given this information, the equation for maximizing the likelihood can be 
determined. The equation in this form can be optimized (to the degree of finding 
a local maxima) by use of the Expectation-Maximization algorithm. This gives 
the following auxiliary function representation shown in equation (9). 

v(z 9 £)ac exp|-^(M - M G j r a (M - a# c )| x eX p - *)'cr(M - *)} (Eq. 9) 

Expressing this in natural logarithmic from results in equation (10). 

4 

In A) = — (M - ft a )' IV, (AT - ft a ) - ^ (M - x )' Cr(M -^A^constant^q^) 



Taking the partial derivates with respect to each element of M gives 



3to fM = -2(Cr + r a )M + 2{Crx + r cf i a ) 
om 



(Eq. 11) 



In determining the partial derivatives, the following equalities prove 
useful. Here m is an arbitrary variable vector and T is a symmetric matrix 

(i.e.r=r>. 



dmT 
dm 



dim 
dm 



dm'Tm 
dm 



= 2T#n 
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In order to locate the stationary points of the auxiliary function as 
expressed in equation (11), the derivative is set to zero, i.e. -Qf =0 - Tnis 
reduces the equation to the form represented in equation (12). 

(Cr + r c )M = Crx + r af t Q (Eq. 12) 

Solving for M yields the MAP solution 

M = (Cr + r G )~ ! (Crx + r Q fi G ) (Eq. 13) 

This is reducible into the form of a weighted contribution of prior and new 

* 

information. 

M==a,,x + (l-a M )// G <Eq.14> 



where - a^ = (Cr + r Cf ) _l Cr 
(l-a A/ ) = (Cr + r c r , r c 
Now given that the global mean ^ is set to the concatenated 
background model means, the factor a„ contains information relating to the 

■ 

proportion of new to old information contained in the background model that is 

1 5 to be included in the adaptation process. 

Now that the adaptation equation is capable of handling the prior 
correlation information within the MAP adaptation framework one method for 
determining the global correlation components is the Maximum Likelihood 
criterion. The Maximum Likelihood criterion estimates the covariance matrix 

20 through the parameter analysis of a library of Out-Of-Set (OOS) speaker 
models. If the correlation components describe the interaction between the 
mixture mean components appropriately, the adaptation process can be 

controlled to produce an optimal result. The difficulty with the data based 

» 

approach is the accurate estimation of the unique parameters in the ND by ND 
25 covariance matrix. For a complete descripuon of the matrix, at least M3+1 
unique samples are required to avoid a rank deficient matrix or density function 
singularity. This implies that at least MD+1 speaker models are required to 
satisfy this constraint. This requirement alone can be prohibitive in terms of 
computation arid speech resources. For example, a 128 mode GMM with 24 
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dimensional features requires at least 3073 well-trained speaker models to 
calculate the prior information. 

The Maximum Likelihood solution involves finding the covariance 

statistics using only the out-of-set speaker models. So, if there are s oos out-of- 
set models trained from a single background model with the concatenated 
mean vector extracted from the jih model given by, //^the covariance matrix 

estimate, TL% L , is simply calculated with equation (15). If the estimate for the 

mean is known, then equation (16) need not be used. Such an example is 

where the background component means are substituted for . 

S — 1 y„| 



! x 005 (Eq. 16) 

with ^-4-5>« - - 

S 

Unfortunately, if there are insufficient models to represent the covariance 
matrix, the matrix becomes rank deficient and no inverse can be determined. 
This difficulty of a rank-deficient covariance matrix is shared with subspace 
15 adaptation approaches such as "eigenvoice" analysis that are applied in both 
speech and speaker recognition. This difficulty may be resolved through a 
number of methods described below, that are also applicable to eigenvoice 
analysis. 

One . method involves Principal Component Analysis (PCA). This 
20 approach involves decomposing the matrix representation into its' principal 
components. Once the principal components have been extracted, they may be 
used in conjunction with (empirical, data-derived or other) diagonal covariance 
information for adaptation. Restricting adaptation solely to this lower 
dimensional principal component subspace likewise restricts the capability for 
25 adapting model parameters outside the subspace. This causes performance 
degradation for larger quantities of adaptation data, which may be alleviated by 
using a combined approach. Ideally, a technique that can exploit some of the 
significant principal components of variation information with other adaptation 
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statistics may operate robustly for both short and lengthy training utterances. In 
this manner, the principal components may restrict the adaptation to a 
subspace for small quantities of speech and will converge to the maximum 
likelihood solution for larger recordings. 

5 Another solution for avoiding the generation of a singular covafiance 

matrix, but not necessarily limited to this, is to reduce the magnitude of the non- 
diagonal covariance components. This approach allows the inverse of the 
matrix to be determined. It also permits the covariance matrix to allow 
adaptation of the target model parameters outside the adaptation subspace 

10 defined by the OOS speaker variations. The covariance estimation, given that 
the global mean is known, is performed using equation (17). Here diag{) 
represents the diagonal covariance matrix and £, is generally a small number 
near zero but between zero and one. 

15 Another possible method for determining the global correlation 

components is Bayesian adaptation of the covariance and (if required) the 
mean estimates by combining the old estimates from the background model 
with new information from a library of reference speaker models. The reference 
speaker data library is comprised of s°°* out-of-set speaker models 

20 represented by the set of concatenated mean vectors, {tfj 0 * }. In addition, the old 

mean and covariance statistics are given by i4? and Ejf respectively. 

= + (1 - < E * 1 9 > 

with. ^r^rV^/z^r' < e * 2 °> 

* 

oos 

25 <f- aorrpS CBq-»» 

S T *J 

If the global mean vector estimate is known then rfg* = = . One 
estimate may be to set these parameters to the background model mean vector 
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/4 A/ . Inthe instance that the mean of the Gaussian distribution is known, and 
only the covariance information is adapted, the adapted covariance becomes 
equation. (22). 

£ «M = + (l - 4)(rr): x (Eq. 22) 

The prior estimate of the global covariance, according to standard adaptation 
techniques, is 'given by (rr)" 1 while the new information is supplied by the 
covariance statistics determined from the collection of OOS speaker models. 
The hyperparameter r is the relevance factor for the standard adaptation 
technique and the matrix r is the diagonal concatenation of the Gaussian 
mixture component precision matrices. The variable £ is a tuning factor that 
represents how important the sufficient statistics, which are derived from the ML 
trained OOS models, are relative to the UBM based diagonal covariance 
information. Now, if the OOS " model derived covariance information is 
unreliable, £ should reduce to 0. In this case, the adaptation equation then 
resolves into the basic coupled mixture component mean adaptation system i.e. 

M = (Cr + r <; )" f (Crx + r c ji G ) becomes M = (Cr + rl)** 1 (Cx + xfi C} ) . However, as the 
value of £ increases, the emphasis on using covariance information derived 
from the multiple OOS speaker models is increased. The strength of MAP 
estimation of the covariance statistic is that the adapted covariance matrix will 
. not be rank deficient provided the old covariance information is of full rank and 

£ is less than 1 . 

Thus in accordance with the EM algorithm with the MAP criterion the 
reference speaker data X oos 21 is utilised to adapt the background model for 
each speaker contained in the reference speaker data library to form a set of 
adapted speaker models in the form of GMM's 23. 

The covariance statistics of the component means are then extracted 
from this adapted library of models 24 using standard techniques, see equation 
15. The resultant of this extraction is the formation of a component mean 
covariance (CMC) matrix 25. The CMC matrix may then be used in conjunction 
with the background model 13 to estimate the prior distribution for controlling 

* 

the target speaker adaptation process. 



With reference to figure 3, there is illustrated the third stage of the 
modelling process utilised by the present invention. The background model 13 
and the CMC matrix 25 are combined to estimate the prior distribution 31 for the 
set of component means. 

Alternatively, the CMC matrix may be used in further iterations of 
reference speaker model training, in this instance the CMC data is fed back to 
re-train the reference speaker data with the background model, and then re- 
estimating the CMC matrix. This joint optimization process allows for variations 
of the mixture components to not only become dependent on previous iterations 
but also on other components further refining the MAP estimates. .Several 
criteria may be used for this joint optimization of the reference models with the 
prior statistics, such as the maximum* joint a posteriori probability over all 
reference speaker training data, eg. 

- argmaxXlogmax^Xj^V^^) {Eq. 23) 

A training sequence is acquired for a given target speaker either directly 
or from a network 32. For normal training of speaker recognition models at 
least 1 to 2 minutes of training speech is required. This training sequence and 
the prior distribution estimate 31 are then utilised in conjunction with the MAP 
criterion as derived in the above discussion to estimate a speaker model for a 

given target speaker 34. 

The target speaker model produced in this instance incorporates model 
correlations into the prior speaker information. This enables the present 
invention to handle applications where the length of the training speech is 
limited. 

Figure 4 illustrates one possible application of the present invention 
namely that of speaker verification 40. A speech sample 41 is obtained either 
directly or from a network. The sample is compared against the target model 43 
and the background model 42 to produce similarity measures for the sample 
against the target and background models. The similarity measure is preferably 
calculated using the expected log likelihood. When comparing the likelihood 
between classes the likelihood ratio may be treated as independent of the prior 
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target and impostor class probabilities />(0 and P(X noa ) . The LR statistic is 
expressed as 

LR(x,)= P HM (Eq ' 24) 

For ease of mathematically manipulating the solution the logarithm is 
5 - taken, resulting in the Log Likelihood Ratio (LLR) which is given as 

LLR(x, ) = log p{x, \X IQr )- log p(x, \A„ m ) (Eq. 25) 

If the likelihoods are in fact probability densities, the likelihood ratio of a 
single observation, may be used to determine the target speaker probability 
10 given that the sample was taken from either the target or non-target speaker 
distributions. 

• * 

P h I 1 LR{x,)P{X lar ) (Eq.26) 

Given T observations, assumed independent and identically distributed, 
X = (x, ,x 2 ,. ..,x r ), the ratio of the joint likelihoods in log form is given. 

7* 

1 5 LLR(X) = £ log p{x t |A ror )- log p(x t \A nan ) (Eq. 27) 

1=1 

In practical applications, this estimate for a target speaker model figure of merit 
is not a robust measure, since the observations are not independent or 
identically distributed and also that there is a dependence between the 
background model and the coupled target models. A more robust measure for 
20 speaker verification is the expected log-likelihood ratio measure given by 
equation 28. This measure is typically used in forensic casework applications 
and is typically compensated for environmental effects through score 
normalisation. 

E[LLR(x,)]= E[\ogp{x l \A, ar )-iogp{x,\Z Mn )] (Eq. 28) 

25 =^Z( lo g^Kr)-logp(* ( K m )) (Eq-29) 



♦ 



-19- 

A similarity measure is then calculated in the above manner for the 
acquired speech sample 41 compared with the background model 42 and for 
the acquired speech sample compared with the speaker model of the target 
person 43. These measures are then compared 44 in order to determine if the 
5 speech sample is from the target person 45. 

To demonstrate the effect of including correlation information, the 
present invention will be discussed with reference to figure 5 which represents 
the speaker detection performance of one embodiment of the present invention. 

In this instance, a fully coupled target and background model structure 
10 was adapted using the above-described approach. Here, model coupling refers 
to the target model parameters being derived from a function of the training 
speech and the background model parameters. In the limit sense when there is 
no training speech the target speaker model is represented as the background 
model. The embodied system also utilised a feature warping parameterization 
15 algorithm and performed scoring of a test segment via the expected log- 
likelihood ratio test of the adapted target model versus the background model. 

The system evaluation was based on the NIST 2000 and 1999 Speaker 

a 

Recognition Databases. Both databases provide approximately 2 minutes of 
speech for the modelling of each speaker. The NIST 2000 database 

20 represented a demographic of 416. male speakers recorded using electret 
handsets. The information of the 2000 database was used to determine the 
correlation statistics. While the first 5 and 20 seconds of speech per speaker in 
the 1 999 database was used as the training samples. 

Detection Error Trade-off (DET) curves for the system are shown in 

25 figure 5, the system curves are based on 20 second lengths of speech for a set 
of male speakers processed according to the extended MAP estimation 
condition, and whereby the number of out-of-set (OOS) speakers was increased 
for each estimation of the covariance matrix-statistics. The selection of OOS 
speakers involved using 20, 50, 100, 200 and 400 speakers. The result for the 

30 baseline background model is also identified In the plot. Because the number of 
OOS speakers is less than the number of-rows or columns in the matrix, the 
matrix is singular. To avoid this problem, the non-diagonal components of the 
covariance matrix are deemphasized by 0.1%. It is clear from figure 5 that 
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utilising the correlation information in the modelling process yields a continued 
increase in performance for an increasing number of OOS speakers used in 
estimation of the covariance matrix. It is important to note that the number of 
speakers is significantly below the minimum of 3073 speakers required for a . 

5 non-singular matrix estimate without the need of deemphasizing the non- 
diagonal covariance components. Ideally, the evaluation requires the number of 
OOS speakers to be an order of magnitude more. However, the improvement in 
performance by using the correlation information in the modelling process is 
apparent from figure 5. 

10 Figure 6 illustrates a plot of equal error rate performances for the 20- 

second training utterances and for 5-second utterances for the system of figure 
5. For 5 seconds of training speech, using the correlation information, the EER 
is reduced from 28.8% for 20 speakers to 20.4% for 400 speakers. 
Correspondingly, the 20 second results indicated an improving performance 

15 trend of 24.3% EER for 20 speakers down to 16.6% EER for 400 speakers. In 
both instances the background model based system performance . exceeded 
that of the best covariance approximation system giving a 14.8% EER. However 
it is to be noted that background model based system error rates would be 
outperformed by the covariance prior estimate system if more OOS speakers 

20 were available as the background model baseline covariance matrix is far from 

becoming an accurate estimate of the true covariances. ^ 

It is to be understood that the above embodiments have beerTprovided^ 
only by way of exemplification of this invention, and that further modifications 
and improvements thereto, as would be apparent to persons skilled in the 

25 relevant art, are deemed to fall within the broad scope and ambit of the present 
invention described herein. 

DATED THIS FIFTH DAY OF DECEMBER 2003 
QUEENSLAND UNIVERSITY OF TECHNOLOGY 
30 BY 

PIZZEYS PATENT AND TRADE MARK ATTORNEY 
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